Large-scale ligand-based predictive modelling using support vector machines
نویسندگان
چکیده
The increasing size of datasets in drug discovery makes it challenging to build robust and accurate predictive models within a reasonable amount of time. In order to investigate the effect of dataset sizes on predictive performance and modelling time, ligand-based regression models were trained on open datasets of varying sizes of up to 1.2 million chemical structures. For modelling, two implementations of support vector machines (SVM) were used. Chemical structures were described by the signatures molecular descriptor. Results showed that for the larger datasets, the LIBLINEAR SVM implementation performed on par with the well-established libsvm with a radial basis function kernel, but with dramatically less time for model building even on modest computer resources. Using a non-linear kernel proved to be infeasible for large data sizes, even with substantial computational resources on a computer cluster. To deploy the resulting models, we extended the Bioclipse decision support framework to support models from LIBLINEAR and made our models of logD and solubility available from within Bioclipse.
منابع مشابه
STAGE-DISCHARGE MODELING USING SUPPORT VECTOR MACHINES
Establishment of rating curves are often required by the hydrologists for flow estimates in the streams, rivers etc. Measurement of discharge in a river is a time-consuming, expensive, and difficult process and the conventional approach of regression analysis of stage-discharge relation does not provide encouraging results especially during the floods. P
متن کاملScaling Factorization Machines to Relational Data
The most common approach in predictive modeling is to describe cases with feature vectors (aka design matrix). Many machine learning methods such as linear regression or support vector machines rely on this representation. However, when the underlying data has strong relational patterns, especially relations with high cardinality, the design matrix can get very large which can make learning and...
متن کاملIdentification and Adaptive Position and Speed Control of Permanent Magnet DC Motor with Dead Zone Characteristics Based on Support Vector Machines
In this paper a new type of neural networks known as Least Squares Support Vector Machines which gained a huge fame during the recent years for identification of nonlinear systems has been used to identify DC motor with nonlinear dead zone characteristics. The identified system after linearization in each time span, in an online manner provide the model data for Model Predictive Controller of p...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملکاربرد الگوریتمهای دادهکاوی در تفکیک منابع رسوبی حوزۀ آبخیز نوده گناباد
Introduction: Reduction of sediment supply requires the implementation of soil conservation and sediment control programs in the form of watershed management plans. Sediment control programs require identifying the relative importance of sediment sources, their quantitative ascription and identification of critical areas within the watersheds. The sediment source ascription is involves two...
متن کامل